DOCDHENT R6S0ME 

TH 810 25B 

Frisbie^ David A* 

A Method for Comparing Test Difficulties. 
Apr 81 

12p*: Paper presented at the Annual Meeting of ttie 
National Council on Measurement ia Education (Los 
Angeles, CA, April 11-17, 1981). 

MF01/PC01 Plus Postage. 

^Comparative Testing: *Difficulty Level; *EvaluatiDr. 
Methods; *Raw Scores: ^Scaling; Test Items; 
* Transformations (Mathematics) 
♦Relative Difficulty Ratio 



The relative difficulty ratio (RDR) is used as a 
method of representing test difficulty. The RDR is the ratio of a 
test mean to the ideal mean, the point midway between the perfect 
score and the mean chance score for the test. The RDR traaf ormati on 
is a linear scale conversion method but not a linear equating method 
in the classical sense. The transformation is used with noa- parallel 
testSr tests which may differ in both item format and length. The 
goal of the transformation is not to "equate" for differen:;es in test 
difficulty. The purpose of the RDS transformation is to convert the 
raw score scale from a given test so that the new scale has the same 
mean chance point and the same range or length as the raw score scale 
from a second test.. The goal is to eliminate differential guessiag 
chance factors and unequal test lengths as competing explanations for 
why two test means are different (or the same). A method f^r 
transforming one of the rav score scales so that an infereatial test 
of mean differences can be accomplished though the original raw score 
scales vary in mean chance score and/or length is describe!. (RL) 



£D 201 665 

AOTHOE 
TITLE 
POB DATE 
NOTE 



EDBS PRICE 
DESCRIPTORS 



IDENTIFIERS 
ABSTRACT 



ERIC 



* Reproductions supplied EDRS are the best that can be made * 

* from the original document. * 



us DEPARTMENTOF HEALTH, 
EDUCATION & '.VELPARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS DOCUMENT HAS BEEN REPRO- 
DUCED EXACTLY AS RECEIVED PROM 
THE PERSON OR ORGANIZATION ORIGIN- 
ATING IT POINTS OF VIEW OR OPINIONS 
STATED 00 NOT NECESSARILY REPRE* 
SENTOFFlClAL NATIONAL iNSTlTUTEOr 
EDUCATION POSiTiON OR POLICY 



A METHOD FOR COmmSfG 
TEST DIFFICULTIES 



David A. Friabie 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



limmvmsant and Red^arch Divldlon 
Office of luetnicCional Resource® 
Unlveraity of Illinois at Urbana^-Ckaapaign 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



Presented at the Annual Meeting of the 
National Council on Measurement and Education 



Los Angeles 
April 1981 



A Mci-Liod for Comparing 1 :.-)t Dlrficultle: 



latroductlcg 

How can the diffii-:ulc— (means) of two coi::t;ent-F^--^^ll-i t .uc^ be compared 
if such tests differ .n meair cimnce scores or test lengcii: Ic zzample, the 
meaa® of a lOO-item f^^ir~chcic£ mult :.ple choice test and -ont --parallel 
150-itcm true false test cannot be c::mipared meaningfully. Their :::::ance scores are 
25 and 75, respectively j, and thsir . angths are in a t\fo to thra- ratioi. When 
administered to the same or rsndoml: equivalent groups, :38t reliabilities and 
validity can be studied airec;:iy. Jt test difficulty cairrcr. a addr(S8se<l using 
the m©an raw scores. 

When faced with thi3 prcbiea. some researchers have maoa. Inappropriate 
comparisons which have yielded inc:rcurate conclusions while otnera have simply 
identified the lack of comparsbllirr without drawing conclusions. Studies by 
Ebcl (1978), Frisbie (1973, 1974), and Mendelson, et al. (Rote 1) are examples of 
research aimed at discovering differences in properties of content-^parallel tests 
in whii*h item format varied* A study by Hughes and Trimble (1965) illustrates how 
some researchers vary the nature of distractors in ?aultiple choice items which also 
vary in the number of choices per item. Item format has been used as an independent 
variable in studies where another independent variable may be confounded with item 
format, Benson and Crocker (1979) and Huck (1978) both represent examples* 

In view of the obstacles to comparison presented by ctmnce score differences 
and variability in test lengths Frisbie (in press) proposed the use of a relative 
difficulty ratio (RDR) as a method of representing test difficulty. The HDR is 
simply the ratio of a test mean to the ideal mean, the point midway between the 
perfect score and the mean chance score for the test. The multiple choice and 
true-^fdlse tests referred to above would have ideal means of 62.5 and 11?* 5. 



Thase neena c ideal ts' t^.. seziB^ znat, fcr z::::3m-ref ensnced purposes » the overall 
c^iscriminatic^r ^2i>abilf.tles of -riis tests are ^caxixnized. A convenient scale 
rransformati::-: can be ^^ppl±Hd, the basic defrinitional formula of RDR to arrive 
at the compii~t tonal fcnssuLi xrz..Lch appears as equation ^:ie. Tie value of RDPx 
using equatir: ne. where X 70.5, K 1C3, IL =» 62»5s I :^ the number of items, 

IDE » (X - ij) :k - Xj) (1) 
and is the ideal mecn. r-riLc r ,213* Positive valuer: of RDR are associated 
vith tests which ere too ear" anc .:agative vai^^as are asnociated :?ith testa which 
are too difficult in the noTr-y:^:zys'':iz:,zed sensEL The RDL fc: two tests which are 
comparable in content could as cc=rred to dsE^nBrmine the ralacive difficulty of 
either. 

A significant limitatlcri cf g the RDR statistic for judging test difficulty 

is that sampling fluctsiation ii=red. The RDR is strictly descriptive; 
probability statements about cisBEmBd differences being statistically significant 
cannot be made because the :=iiGorErr::zal random sampling distribution of RDR has not 
been Identified. Another l±2iz::D~^ Qf using RDR to make inferences is that the 
RDR scale is an unconsnon onsi; mUL ±iff erences arc less easily interpreted than raw 
score differences. 

The purpose of this :::Eii-iar 12:^;^^— describe a method for transforming one of the 
raw score scales so that ar. infai-mziai test of mean differences can be accomplished, 
though the original raw score scales vary in mean chance score and/or length. The 
transformation using RDR overcomes the limitations cited above of using RDR alone 
to make inferences. In add it ion » though the transformation is linear, it does not 
accomplish linear equating as such equating is classically defined. 



4 



-3- 



Theoretical Development 

For purposes of Illustration, we assume that two tests, A and B, have 
been developed so as to be content parallel in the general cr specific sense 
(i.e*, items were written from sampling the same population of Instructional 
objectives or items were written in one format and then converted to another 
format). Furthermore, A and B are choice-type tests with different mean chance 
scores (i.e., the number of choice options varies) and unequal test lengths* 
The problem is to statistically test the difference between the means of A and B, 
derived from administering A and B to the same group or randomly equivalent groups. 
To do so, the scores on Test A must be converted to the Test B score scale, or 
vice versa. Then we can choose the appropriate statistical technique (e.g., 
t-test, ANOVA) in noraal fashion for testing statistical significance. 

The Transformation 

The scale transformation begins with equation 2j the RDR of the test scale on 
which the transformation is to be performed. Test A. The RDR for Test B can be 

expressed in identical fashion by U0ing the B subscript instead of A. RDR^ can 
be rewritten to solve for as shown in equation 3. 

^ " RDR^ (K3 - Xjg) + (3) 
If test B were to yield an ia)R equal to RDR^, what would the mean of Test B 
equal? The mean of Test A can be transformed to the Test B score scale by 
substituting I^R^ in equation 3 for RDRg. Equation four represents the transformed 
meen^ X^, where the prime denotes a transformed value. 



ERIC 



-4- 



To arrive at a computational formula for X', equation two is substitut 



equation four to yield 



Through multiplication and algebraic simplification equation five can he wrj 
as shown in equation six« 



Equation six is written to show that the transformation is linear; 
the mean of Test A is multiplied by a constant and an additional constant iz 
added to that product. All scores on Test A can be converted to the Test B 
score scale via equation six by replacing X^ with X^ and X^ with X^. Equat 



seven shows that the standard deviation for Test A can be converted to the 
scale by dropping the second term of equation six. 



- V 



The mean score difference, X^- X^, can be tested for statistical signr±— 3nce 
using whatever procedure would be appropriate had A and B not had different: rusn 
chance scores or test lengths. The appropriate standard error can be derived frpm 
S„ and S' or from the Test B raw scores and the transformed Test A raw scores. 

D 5 

Assumptions 

The assumptions made in transforming X^ to X^ include those associated with 
KDR (Frisbie, in press): (1) a wrong response to an item is a random guess from 
options In that item, (2) dichotomous scoring is used, and (3) there is only one 
correct answer per item. In addition, if Test B is longer than Test A, the 
transformation of X^ to X^ requires that the assumptions associated with the 



EKLC 



-5- 



far ._.Lar Spa^irman-BTroT Prophecy Formula be applied to t.::e ' - « That ^s, 
the i: .3oreti::::ally-lenr::-2riEd test adds items which are sc— :-ient ir content 
and ^faculty to the r:zx±:ial items and the added length %Tr_d r . : ::-.tribute 
to cTTzralnee fat-gue c: zzi-i:^e in examinee psychological ^at 

Though tr:J^e a3sunx-L.ons are commonly made in psychomt^tric :: e£.rch, 
the ±^xst, rrnp^:: ding rz::::— guessing, is troublesome to many .rer^- 
specialists, -^my wo*^: ^ray that the assumption is unreascn:2'rli:- ^ec^usa, in 
practice, it .J^ alwayr -ridated. Yet in making item format cui-..crifiGn3 using 
RDRj this £EL .irnr:ion applied systematically to both tests neing crapared. 
Xh'^re is li- e reasoz :o expect that the effect of violating th'a ri^-doni 
guessing asEi^-tz-on wzz:^d jj^ much more influential on one it~ :i forma- than 
another. 

Further ^stimptims may need to be made, depending on the variability 
of scores an: lucrber items on the raw score scales to be compared. For 
example, negative numbers may appear on the transformed scale if highly 
variable sccres on a long scale are transformed to a relatively short scale. 
An interval measurement scale must be assumed in such cases. Negative score 
'V'alues can be avoided by applying the transformation to the set of scores having 
the smallest amount of variability. Non-integer values of transformed scores 
should be retained so that precision is not lost if statistical tests are performed. 

Discussion 

The RDR transformation is a linear scale conversion method but not a linear 
equating method in the classical sense. The transformation is used with non- 
parallel tests, tests which may differ in both item format and length. The 
goal of the transforxoation is not to "'equate^' for differences in test difficulty. 




• 7 



-6- 



Tha conversion r rocess is nc intended to yield equivalent scores as le :night 
accomplish by ?r\jloylng line-iLi or equipercentil . equating (Anghoff, 1971). The 
purpose - f the transform r^r lor Is to convert the raw score scale from a given 
test so at new scale has tb same mean chs/ice point and the sannr range or 
length c: zhi r^w score scair: frrr:. a second test. The goal is to eliminate 
differerr -jiassing chance fac 3 and unequal test lengths as competing 
explanarr: as or why two test merin.: are different (or the same) • Though some 
researcrr^-c ^3 used a correctrr^n for guessing to address the chance score 
problen* thu t st length probler j :ill remained unresolved. The correction for 
guessiag: sclu „.,on is generally l.i:;BS attractive solution because, if es^iaminees 
are in: ^rmed jout the correctirr, risk-staking behaviors and differential guessing 
or omiirr Ing strrategies tend tr r:r.troduce sources of score invalidity. The use of 
the FEr transformation require: no special scoring and no special directions to 
examirii.as and it accounts for lasc length differences in the measures to be compared. 

Z-e data in Table 1 are r.rtificial scores used to illustrate an application of 
the PDR transformation. Assuizie that Test A has 12 four -choice mnlliple choice 
items. Test B has 18 true-false items, and that these tests were administered 

[insert Table 1 About Here] 
to chance halves of a group of 20 examinees. Ignoring the RDR transformation, 
a t-test of the difference, X, - 3L, is significant at a < ,005 with nine 
degrees of freedom. This result leads to the erroneous conclusion that Test A is 
more difficult than Test B. 

On a purely descriptive basis, the RDRs shown in Table One indicate that 
Test A is slightly easier than Test B. The equation at the bottom of the table 
is used to calculate the mean for Test A as it would appear on the Test B scale. 
When the difference, - X^, is tested for statistic^il significance, the 
result (t « 0,261) is not significant at a reasonably acceptable level. 



8 



-7- 



The conclusion that Tests A and B do not differ in diffij:ul-y for tha population 
under consideration is w::^rranted because the mean chance score difference and 
the test length difference have been adjusted by the RDI. transf ormation. 

The adjustments made by the RDR transformation are es sntial when 
investigating the effect of item format on test difficult: . If such an effect 
exists, then its Impacc on validity must be determined. I :jst measurement 
practitioners assume that objective item formats are interchangeable; i.e. , 
it makes no difference if multiple choice, true-fal'iie, matching items are 
used to measure achievement. If these formats do have a differencial ei;f act 
on validity, then the circumstances under which each -^3 optimumly valid need 
to be investigated. Such studies could begin by applying tne RDR transiarraatlon 
to item difficulties. The formula for item RDRs (Frlsbie, in press) cer. be used 
to derive item level calculations analagous to equations zrjcee through five 
presented above. If item RDRs were to yield important differences, case study 
methods might be useful for investigating the test taking process variables 
which contribute to these differences. 



■9 



Table 1 



Illustrations of the RDR Transformation 







X6Su ocores 






V 

\ 




V 

h 




10 


16.4 


18 




10 


16.4 


18 




10 


16.4 


18 




9 


15.7 


17 




9 


15.7 


16 




8 


14.9 


16 




8 


14.9 


15 




7 


14.1 


13 




7 


14 .1 


12 




7 


14.1 


7 


K 


12 


18 


18 


X 


8.5 


15.3 


15 


s 

X 


1.26 


0.976 


3.50 


RDR 


.222 


.222 


.143 




(.222) 


(3.5) + 14.5 





10 



-9- 



Rererance Notes 

1. Mendelson- M. A., et al. The effect of format on the difficulty of 

multiple-completion test items * Paper presented at the meeting of the 
National Council on Measurement in Education, Boston^ April 1980. 



-10- 



References 

Anghoff, W. H. Scales, norms, and equivalent scores. In Thorndike, R. L. (Ed.), 
Educational Measureroent . Washington, D.C.: American Council on Education, 
1971. 

Benson, J. & Crocker, L. The ^v.ffects of item format and reading ability on 

objective test performance: A question of validity. Educational and 

Psychological Me . ?Burement , 1979, 39, 381-387, 
Ebel, R. L. The ineffectiveness of multiple true-false test items. 

Educational and Psychological Measuremggt , 1978, 38 , 37-44. 
Frisbie, D. A. Multiple choice vs. true-false: A comparison of reliabilities 

and concurrent validities. Journal of Educational Measureme nt, 1973, 10 , 

297-304. 

Frisbie, B. A. The effect of item format on reliability and validity; A study 

of multi*i>le choice and true-false achievement tests. Educational and 

Psychological Measurement , 1974, 34, 885-892. 
Frisbie, D. A. The relative difficulty ratio— a test and item index. 

Educational and Psychological Measurement , in press. 
Huck, S. W. Test performance under the condition of known item difficulty. 

Journal of Educational Measurement , 1978, 15, 53-58. 
Hughes, H. H. & Trimble, W. E. The use of complex alternatives in multiple 

choice items. Educational and Pgychological Measurement , 1965, 25^, 117-125. 



12 



